Decision Trees and Random Forests

In this tutorial we are building  a machine learning model based on decision trees and random forest:

First of all import the libraries 

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

%matplotlib inline

Next thing is to import our data as a read csv file method and then understand the data what is it about. We got this data from lendingclub.com as an open source data. 

There are 14 columns in the data let us see what do they mean

col_name 
credit.policy1 if the customer meets the credit underwriting criteria of LendingClub.com, and 0 otherwise.
purposeThe purpose of the loan (takes values “credit_card”, “debt_consolidation”, “educational”, “major_purchase”, “small_business”, and “all_other”).
int.rateThe interest rate of the loan, as a proportion (a rate of 11% would be stored as 0.11). Borrowers judged by LendingClub.com to be more risky are assigned higher interest rates
installmentThe monthly installments owed by the borrower if the loan is funded.
log.annual.incThe natural log of the self-reported annual income of the borrower.
dtiThe debt-to-income ratio of the borrower (amount of debt divided by annual income)
ficoThe FICO credit score of the borrower
days.with.cr.lineThe number of days the borrower has had a credit line.
revol.balThe borrower’s revolving balance (amount unpaid at the end of the credit card billing cycle).
revol.utilThe borrower’s revolving line utilization rate (the amount of the credit line used relative to total credit available).
inq.last.6mthsThe borrower’s number of inquiries by creditors in the last 6 months.
delinq.2yrsThe number of times the borrower had been 30+ days past due on a payment in the past 2 years.
pub.recThe borrower’s number of derogatory public records (bankruptcy filings, tax liens, or judgments).

Next, check the data for any missing values in it by df.info (  )

Yay! This dataframe has zero null values. Now let us divide the data into target variable ( y ) and other independent variables , here our target variable is “not fully paid”. But there is column that contains strings , that cannot be trained.

Now we would train our 70% of the data.

Now let us import decision tree classes and train the model

Now let us test and evaluate the data and check the results

We have successfully built a model based on decision tree which has a very average accuracy. Now let us build a model based on random forests

Here in the function RandomForestClassifier, we have passed an integer which means number of decision trees in the random forests, as we have discussed in the definition of random forests.

Now let us see the evaluation of the model

We can observe the difference between the two models, there is a good improvement in the f1 score and accuracy of the model.

This is the end of this tutorial about Decision Trees and Random Forest

  • In the upcoming tutorials, we would learn about
    Support Vector Machines
Spread knowledge

Leave a Comment

Your email address will not be published. Required fields are marked *